Construction and evaluation of a practical model for measuring health-adjusted life expectancy (HALE) in China

Background HALE is now a regular strategic planning indicator for all levels of the Chinese government. However, HALE measurements necessitate comprehensive data collection and intricate technology. Therefore, effectively converting numerous diseases into the years lived with disability (YLD) rate is a significant challenge for HALE measurements. Our study aimed to construct a simple YLD rate measurement model with high applicability based on the current situation of actual data resources within China to address challenges in measuring HALE target values during planning. Methods First, based on the Chinese YLD rate in the Global Burden of Disease (GBD) 2019, Pearson correlation analysis, the global optimum method, etc., was utilized to screen the best predictor variables from the current Chinese data resources. Missing data for predictor variables were filled in via spline interpolation. Then, multiple linear regression models were fitted to construct the YLD rate measurement model. The Sullivan method was used to measure HALE. The Monte Carlo method was employed to generate 95% uncertainty intervals. Finally, model performances were assessed using the mean absolute error (MAE) and mean absolute percentage error (MAPE). Results A three-input-parameter model was constructed to measure the age-specific YLD rates by sex in China, directly using the incidence of infectious diseases, the incidence of chronic diseases among persons aged 15 and older, and the addition of an under-five mortality rate covariate. The total MAE and MAPE for the combined YLD rate were 0.0007 and 0.5949%, respectively. The MAE and MAPE of the combined HALE in the 0-year-old group were 0.0341 and 0.0526%, respectively. There were slightly fewer males (0.0197, 0.0311%) than females (0.0501, 0.0755%). Conclusion We constructed a high-accuracy model to measure the YLD rate in China by using three monitoring indicators from the Chinese national routine as predictor variables. The model provides a realistic and feasible solution for measuring HALE at the national and especially regional levels, considering limited data. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-024-19112-6.


Collection of independent variables
The Global Burden of Disease (GBD) 2019 divides all diseases into 3 major categories of 369 diseases, the three major categories are the first major category of diseases (communicable, maternal, neonatal, and nutritional diseases), the second major category of diseases (non-communicable diseases), and the third major category of diseases (injuries).diseases (injuries).Using Chinese data released in GBD 2019, we explored the relationship between the years lived with disability (YLD) rates and total prevalence of the three major categories of diseases across 22 age groups by gender based on multiple linear regression models.The R 2 results showed that, although individual age groups were more variable, the R 2 were very large overall, e.g., R 2 for the combined, male, and female all-ages were 0.98, 1.00, and 0.95, respectively.See Figure S1.This suggests that age-specific YLD rates are highly correlated with the overall prevalence of the three major disease categories.
Combining the actual available data resources within China at present, we selected three representative indicators -incidence rate of A and B statutory infectious diseases (IID), the prevalence of chronic diseases among persons aged 15 and older (PCDPF), and the two-week prevalence of impairment poisoning (TWPIP) -to represent the three major categories of diseases respectively.In addition, to improve the measurement accuracy of the model, we refer to the relevant covariates in the GBD technique as well as data accessibility, and screened 4 covariates, namely, under-5 mortality rate (U5MR), educational attainment per capita (EPC), urbanization rate (UR), and gross domestic product per capita (GDPPC), through Pearson correlation analysis.Table S1 shows the correlation coefficients of China's total YLD rates for different genders with the four covariates (U5MR, EPC, UR, GDPPC).
The seven independent variables were collected from the following sources:   Report reports are available for six years (1993,1998,2003,2008,2013,2018).In addition, the prevalence of chronic diseases in the National Health Service Survey Analysis Report is given for all ages in 1993, 1998, 2003, and 2008, while the last two years (2013, 2018) are given for the 15+ age group.
(2)IID In fact, the data on the IID are calculated from infectious disease surveillance data.
Communicable disease surveillance covers Center for Disease Control and Prevention at the provincial, municipal and county levels throughout China. (3)U5MR

Treatment of independent variables
(1)Spline interpolation Spline interpolation was used to fill in the missing years.The PCDPF was interpolated using cubic spline interpolation, based on data from six years (1993, 1998, 2003, 2008, 2013, and 2018), and smoothed data were generated for the years 1990-2019 by using the cubicspline function in the pracma package of the R language.The PCDPF, by sex, is filled in using the same methodology.The TWPIP was interpolated using quadratic spline interpolation, i.e., based on these six years of data using the spline function in the R language splines package to generate smoothed data for the years 1990-2019.
(2)Age group conversions In order to standardize the criteria, we converted the chronic disease prevalence rate of the all-age population into the chronic disease prevalence rate of the population aged 15 years and above in 1993, 1998, 2003, and 2008, based on the age group composition ratio of the surveyed population and the chronic disease prevalence rate of the age group.

Screening of independent variables
We fitted a multiple linear regression model of the seven predictor variables collected and screened above (IID, PCDPF, TWPIP, U5MR, EPC, UR, GDPPC) to the age-specific YLD rates published by GBD.Keeping the three variables of the IID, PCDPF, TWPIP fixed, and adding four covariates (U5MR, EPC, UR, GDPPC), there are a total of 16 combinations, as shown in Table S2.
We performed multicollinearity diagnostics on these 16 models and excluded models with variance expansion factors greater than 10 in the independent variables, leaving four models, namely, Model 12, Model 13, Model 14, and Model 16.With the guidelines of the smaller error of YLD rate and HALE, the smaller number of independent variables, and the strongest accessibility of the independent variables at the provincial, city, and county level (see later for the calculation process of YLD rate and HALE), model 12 was finally selected.Four independent variables were screened out (IID, PCDPF, TWPIP, U5MR).See Table S3.Under the premise of ensuring the accuracy of the model, to further enhance the applicability of the model, this paper shows the 3-parameter model (IID, PCDPF, and U5MR) in the main text, and shows the results of a 4-parameter model (IID, PCDPF, TWPIP and U5MR) results.The entire filtering process uses the R language, and the specific code is available at https://github.com/sanqian-chen/Filtering-Variables。 The maximum, minimum, median, and mean of the two-week prevalence of impairment poisoning were 0.0059, 0.0038, 0.0047, and 0.0048, respectively.The maximum, minimum, median, and mean for the remaining three independent variables (PCDPF, IID, and U5MR) are shown in Table 1 in the main text.In this study, the YLD rate models for 4-parameter and 3-parameter are given, and the 3-parameter YLD rate model is shown in the main text, followed by the 4-parameter model.
The model is based on the input of four indicators, namely, IID, PCDPF, U5MR and TWPIP, in order to obtain the YLD rates for the 21 age groups, as follows: (1)Twenty-one multiple linear regression models were constructed with the four rate indicators as independent variables and the YLD rates of each of the 21 age groups as dependent variables: Where,Y  is the YLD rate for 21 age groups (0 year old group,1-4 year old group,5-9 year old group,10-14 year old group,......,90-94 year old group,95+ year old group),  1 、 2 、 3 、 4 are the IID, PCDPF, U5MR and TWPIP respectively.  are the intercepts,   、  、  、  are the regression coefficients of the corresponding independent variables, Γ(Y  ) and are Logit transformations.
In addition, this study also compared three ways of using raw values, Ln transformation, and Logit transformation for Y  , but Logit transformation was finally chosen.The reasons are as follows: firstly, Logit transform has small error; secondly, the model may have negative numbers when extrapolating in prediction, and the use of Logit transform can circumvent the existence of negative numbers.
(2)In constructing the gender-specific model ( 5), except for the PCDPF, the other three independent variables were modeled by directly replacing the genderspecific rate with the both rate.
The model was constructed using R.The code is available at :https://github.com/sanqian-chen/Filtering-Variables。

Estimation of uncertainty intervals
We assessed the uncertainty of the Model coefficients, YLD rate and HALE estimates based on the uncertainty of the input data from GBD.Generation of 95% uncertainty intervals (UI) based on multivariate normal distribution using Monte Carlo method.Uncertainty was obtained by sampling from a multivariate normal distribution constructed from GBD YLD rates and their 95% UI.
We assumed GBD data are normally distributed with mean values equal to the point estimates.We assume that the GBD YLD rates for each year, after Logit transformation, obey a multivariate normal distribution with 21 dimensions.For each year and sex we proceeded as follows: 1)The YLD rate and its 95% UI for different genders and 21 age groups from 1990-2019 in the GBD database were logit transformed to construct a multivariate normal distribution.The YLD rate and 95% UI stated in this subsection are the values after Logit transformation.
2) We construct a multivariate normal distribution for each of the years 1990-2019, with each of the 21 age groups of each multivariate normal distribution simultaneously randomized to 1000 Γ(Y  ) .The multivariate normal distribution requires two parameters, a vector of means and a 21-dimensional covariance matrix.
3)The mean vector is the YLD rate for each of the 21 age groups for each year.
4) The construction of the 21-dimensional covariance matrix is divided into 3 parts: ① Each age group has its own covariance, or variance.We inverted two standard deviations based on the GDB YLD rate and its 95% uncertainty interval based on the " ± 1.96 × "formula.Its calculation formula: Where

Calculation of indicators of error
In this study, we evaluate the accuracy of this simple model in terms of two dimensions: the YLD rate and the HALE.The YLD rate and HALE published by GDB are used as the real reference values to estimate the error between the calculated values of our model and the real reference values published by GBD.The evaluation methodology is consistent with the main text and will not be repeated here.
PCDPF and TWPIP were derived from the National Health Service Survey and Analysis Report.IID and U5MR are from the China Health and Wellness Statistical Yearbook.EPC, UR and GDPPC are from the China Statistical Yearbook.

Figure
Figure S1 R 2 of multiple linear regression models of age-specific YLD rates and total prevalence of three major disease categories in China

Figure
Figure S2 Residuals of YLD rates by sex and age (4 independent variables)

Figure
Figure S5 Comparison of model-fitted values and true reference values of HALE for different age groups, 1990-2019 (3 Variables, Both)

Table S1
Correlation coefficients of total YLD rate with four covariates across gender in China residents every five years and published that year's the National Health Service Survey and Analysis Report.Currently, the National Health Service Survey and Analysis To effectively manage and control infectious diseases while safeguarding public health and safety, it is critical to promptly implement suitable preventive and control measures.In China, infectious diseases are classified into statutory infectious diseases, infectious diseases necessitating key monitoring, and other infectious diseases.
Statutory infectious diseases are those that must be reported and controlled in accordance with procedures as stipulated by the Law of the People's Republic of China on Prevention and Control of Infectious Diseases.These diseases are highly contagious and harmful, posing a significant threat to public health security.Statutory infectious diseases are categorized into three major groups, namely A, B and C, with 41 diseases.There are two statutory infectious diseases in category A, namely plague and cholera.There are 28 statutory infectious diseases in category B, namely, infectious atypical pneumonia, human infection with H7N9 avian influenza, AIDS, viral hepatitis, poliomyelitis, human infection with highly pathogenic avian

Table S2 Variance
Expansion Factors for the Dependent Variables of the 16

Construction of the YLD rate model and parameter estimation 3.1 Simple modeling for measuring YLD rates by sex and age in China
The mean back-squared of the two standard deviations was calculated as each age group's own covariance for each year.②Covarianceandcorrelationcoefficientfortwo-by-two age groups.We constructed 21 samples with a sample size of 30 in 21 age categories YLD rate from 1990-2019, with each of the 21 samples corresponding to 21 age groups, and then calculated the covariance and correlation coefficients for the two-by-two age groups.)Theobtained30meanvectorsand covariance matrices were constructed into 30 multivariate normal distributions, each with 21 age groups.1000Γ(Y  ) are randomly selected simultaneously from each age group of each multivariate normal distribution.That is, we now have a total of 30 years of data, each with 21 age groups, and each with another 1000 Γ(Y  ).6)Model (1) was fitted to the data obtained with the three independent variables to obtain 1000 copies of model (1), and the 95% UI of the model coefficients   、  、   and   were obtained by calculating the 2.5th and 97.5th quartiles.
③ Construct the 21-dimensional covariance matrix for each year.We assumed that the covariance and correlation coefficients for the two-by-two age groups remained constant, and changed the variance of the main diagonal to produce a covariance matrix for the 21 dimensions for each year.58)Use the mvrnorm function in the MASS package of the R language to realize 1000 random samples from a multivariate normal distribution.The specific code is available at :https://github.com/sanqian-chen/Filtering-Variables/tree/Constructionmodel。 4. Assessment of model error 4.1 Sullivan method of measuring HALE Like life expectancy, HALE is not affected by the age structure of the population.
,   denotes the number of survivors in age group i, w denotes the largest age group (the 95+ year group),   denotes the number of survivor years in age group i, and Y  is the age-specific YLD rate measured by model (6).  and   and are both derived from the life table data in GBD 2019.

Table S5 Results of error assessment of YLD rates by gender and age (4 independent variables)
mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE)，MAE and RMSE are in units of 1 and MAPE is in %.

Table S6 Results of error assessment for HALE by gender and age group (4 independent variables)
mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), MAE and RMSE in years and MAPE in %.